On the Size of Full Element-Indexes for XML Keyword Search
نویسندگان
چکیده
We show that a full element-index can be as space-efficient as a direct index with Dewey ids, after compression using typical techniques. 1 Full Element-Index for XML Keyword Search Keyword search is a crucial operation that has to be supported on XML data. Earlier works attacking this problem from information retrieval (IR) perspective essentially consider disjunctive query semantics (e.g., see [2]); whereas works representing the database (DB) perspective mainly concentrate on Web-style conjunctive semantics (e.g., [1,4]). Typically, an inverted index is the preferred data structure for XML keyword search in both communities. In this respect, a straightforward approach is indexing each element in the XML data as a separate document, which is formed of the text contained in the element itself and that in all of its descendants [2]. This is called a full (element-)index. While a full index can support both disjunctive and conjunctive keyword search semantics, the nested structure of XML data poses some efficiency challenges on its use in practice [1]. Most crucially, in a full index, a term t that is directly contained in an element at depth n is indexed n times, i.e., for each ancestor of that particular element (see Figure 1). This implies a non-trivial overhead in terms of storage space and query processing time. To cope with the above problems of a full element-index, a key decision is indexing only direct textual content for each element, excluding the contents of its descendants. This so-called direct index remedies the redundancy inherent in the full index, and allows disjunctive query processing with a certain level of success (e.g., see [2]). However, for Web-style conjunctive query processing, such a direct index (in contrast to a full index) needs to explicitly capture the ancestor-descendant relationships among the elements. To this end, one of the most widely accepted solutions is labeling each element with Dewey IDs [1,4]. In Dewey ID representation, the label of a given node encodes the path from the document root down to the node so that the ancestor-descendant relationships between the nodes could be determined directly (see Figure 1). Currently affiliated with Google Ireland. 1 In this study, we focus on simple keyword queries without structural constraints. R. Baeza-Yates et al. (Eds.): ECIR 2012, LNCS 7224, pp. 556–560, 2012. c © Springer-Verlag Berlin Heidelberg 2012 On the Size of Full Element-Indexes 557 Fig. 1. An example XML tree and corresponding full and Dewey-encoded indexes In this paper, we question one of the most important arguments against a full element-index, namely, index size. We advocate that, although a raw full index may be larger than a Dewey-encoded index, the size disadvantage may disappear after compression. Our claim is based on two key observations. First, the upper-level nodes of an XML document would be usually shared by many lowerlevel nodes, which would reduce redundancy in posting lists (e.g., in Fig. 1, since nodes 3 and 4 share the ancestor 5, it is enough for the posting list of term “language” to include only two ancestor nodes, namely, 5 and 7). Second, for typical tree-traversal orders, elements with ancestor-descendant relationships would be assigned very close ids, yielding smaller id gaps and higher compression ratio for a full index. In what follows, we justify our claims by a formal discussion and experimental results for three large datasets and different compression methods. 2 A Formal Comparison of Space Complexities We provide a formal analysis for the space complexities of the full elementindex and Dewey-encoded index. Without loss of generality, we use the wellknown Elias-γ compression method [3] and restrict our discussion to compressing element ids (as they occupy the majority of the space in an index). We assume that the input XML tree to be indexed is a complete k-ary tree of depth d. Space Complexity of the Dewey-Encoded Index (ID). Dewey ID of a node at level m consists of m integers (where 1 ≤ m ≤ d). That is, a node at level m is represented with the Dewey ID a = a1.a2.a3. . . . .am. In the worst case, only leaf nodes of an XML tree includes text, i.e., m = d. This is a viable assumption, since in a k-ary tree (k − 1)/k of the nodes are indeed leaves. As each ai is smaller than k, by using Elias-γ compression, a Dewey ID can be represented by at most d(2 lg k + 1) bits. Let’s assume that the posting list of a term t in the direct index ID consists of e number of elements. Then, the compressed size of the posting list of t would be e× d(2 lg k+1) bits. Hence, the space complexity of a Dewey-encoded index is O(ed lg k). 2 Recall that an integer x is encoded in 2 lg x+ 1 bits in Elias-γ compression [3]. 558 D. Atilgan, I.S. Altingovde, and Ö. Ulusoy Table 1. Dataset characteristics No. of Docs. No. of Elem. Max. Depth Avg. Depth Max Fan-out DBLP 1 4.9 million 4 1.9 479,426 Wikipedia 659,388 7.4 million 47 2.6 5,621 XMark 1 1.6 million 11 4.5 10,000 Space Complexity of the Full Element-Index (IF ). Without loss of generality, the nodes of the input XML tree T are labeled with respect to the some tree traversal order of T . If T is a complete k-ary tree, these labels are smaller than the number of nodes in T , which is K = 1+k+k+ ...+kd−1 = (kd−1)/(k−1). Assume that there are e elements in a term t’s posting list in the Deweyencoded index ID, and e ′ elements in corresponding list in the full index IF . As before, we also assume that all of the elements in a posting list of ID are leaf elements at depth d. To compare the sizes of Dewey-encoded and full indexes, we have to estimate e′. We begin by analyzing two extreme cases: (i) If none of the leaf elements has a common ancestor except the root node, then they would have e(d− 1) + 1 distinct ancestors. In this case, the corresponding posting list in IF would have e ′ = e(d−1)+1+e = ed+1 elements. (ii)All ancestors of these e leaf nodes are common. In this case, leaf elements would have d− 1 ancestors and the corresponding posting list in IF would have e ′ = e+ d− 1 elements. However, both of these cases are quite rare. Therefore, we make an average case analysis and try to estimate a decay factor, α, which symbolizes the proportion of decrease in the number of ancestor nodes in consecutive levels. Assume that there are e number of nodes at depth which contain term t directly and these elements have e −1 = αe number of ancestors at depth − 1. Note that α ≤ 1 and hence, e −1 ≤ e . Let’s consider a practical case where α ≤ 1/2. In this case, since e + e/2 + e/4 + ... + e/2 ≤ 2e, e′ is in the order of e. Recall that, for IF , element id gaps are compressed instead of the actual element ids. Since the element ids are between 1 and K and there are e′ elements in t’s list in IF , the average gap would be K/e′. Using Elias-γ method, a gap can be encoded in 2 lg (K/e′) bits. Since e′ ≤ 2e, the compressed size of t’s list with e′ elements is: e′2 lg K e′ = e′2 lg (k − 1)/(k − 1) e′ < 2e2 lg (k − 1) 2e < 4ed lg k = O(ed lg k). Hence, we conclude that for a typical XML tree where α ≤ 1/2, the space complexity of full and Dewey-encoded indexes are both O(ed lg k).
منابع مشابه
Adaptive Partitioned Indexes for Efficient XML Keyword Search
1. INTRODUCTION Keyword search, which is extensively used for searches over flat HTML documents on the web, is a simple and effective paradigm for information discovery. have studied how to effectively apply this useful paradigm to searches over XML documents. XML Keyword search makes it possible for users to obtain relevant information without having to know complex query syntaxes (
متن کاملA Method for Evaluating Full-text Search Queries in Native XML Databases
In this paper we consider the problem of efficiently producing results for full-text keyword search queries over XML documents. We describe full-text search query semantics and propose a method for efficient evaluation of keyword search queries with these semantics suitable for native XML databases. Method uses inverted file index which may be efficiently updated when a part of some XML documen...
متن کاملA System for Keyword Proximity Search on XML Databases
Keyword proximity search is a user-friendly information discovery technique that has been extensively studied for text documents. In extending this technique to structured databases, recent works [6, 7, 4, 2] provide keyword proximity search on labeled graphs. A keyword proximity search does not require the user to know the structure of the graph, the role of the objects containing the keywords...
متن کاملXKFilter : A Keyword Filter on XML Stream
Most existing XML stream processing systems adopt full structured query languages, such as XPath or XQuery, but they are difficult for ordinary users to learn and use. Keyword search is a user-friendly information discovery technique that has been extensively studied for text documents. This paper presents an XML stream filter system called XKFilter, which is the first system for supporting key...
متن کاملTreeguide Index: Enabling Efficient XML Query Processing
XML DBMSs require new indexing techniques to efficiently process structural search and full-text search as integrated in XQuery. Much research has been done for indexing XML documents. In this paper, we first survey some of them and suggest a classification scheme. It appears that most techniques are indexing on paths in XML documents and maintain a separated index on values. In some cases, the...
متن کامل